Red Wine Quality by Robert Bornemann

The goal of this report is to explore a dataset containing 1599 red wine samples. Moreover, it is to highlight aspects of exploratory data analysis as part of the udacity data anlyst course. The dataset used in this report is publicly available for research through:

[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Univariate Plots Section

Our dataset consists of 13 - 1 variables, with 1599 observations. I excluded the variable X in the analysis as it is simply another ID column that R Studio doesn’t need.

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Number of missing values in the dataset:

## [1] 0

I decided to explore the distribution of quality ratings first. The histogram reveals that most red wines in the dataset are of medium quality (μ 5.63) on a scale between 0 (very bad) and 10 (very excellent). The worse red wine has a quality of 3 whereas the best red wine in the dataset has a max. quality score of 8. The distribution is slightly skewed to the left suggesting that there should be a higher number of wines from medium to higher quality in the dataset.

The question we want to ask is which of the follwing chemical properties can help us explain the quality ratings of red wines based on sensory data by wine experts. Lets try to find out!

All the chemical properties describing different acids in the 3 histrograms above seem to be right skewed in various forms. A closer look however reveals that the distrutions of fixed and volatile acidity are most likely skewed because of some outliers. My idea was to trim both of the variables distrutions in order to make them approximately more normal in distribution for further analysis.

Moreover, I am attempting to transform the citric acid variable using the squareroot method in order to get a clearer view of its distrution. The distrutions looks like a waveform peaking and declining at various points of the scale. Zooming in might help to understand this pattern a little better.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

A closer look at the histograms above illustrates how the distrutions of fixed and volatile acidity became more normal in distribution through limiting the x-axis.

Citric acid seems to be different. We can see that a significant part of red wines has a very little or no amount of citric acid. About 8% of red wines contain no citric acid, which makes adding ‘freshness’ and flavor to wines using citric acid rather look optional [Cortez et al., 2009]. On the other hand we can also confirm the bimodal wavepattern from the first histogram and see that the counts are peaking around 0.25 and 0.50. Perhaps adding citric acid is a rather delicate process and mastering this techniques adds to the overall quality score? Hopefully a correlation analysis will clarify these observations further.

Looking at sugars and salts in wine we can also observe a right skewed distribution. Moreover, outliers are stretching both distributions to more than double the size of where the majoroity of the data sits. I am attempting to transform both variables using a log10 method in order to get a clearer view of the distrutions as well as limiting the histogram by ignoring the 1% outliers on the very right of the scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The transformatins applied helped to understand both distrutions better. While sugar still seems to be slightly skewed to the right we can observe that clorides peak very sharp at a very particular point on the scale. It seems that there is far more variation in the amount of sugar and less so in salts. I am, at this point not yet very sure about my intuition on these variables. However, I feel there might be a relationship between sugar and salt ratios and their effect on the quality rating.

As for many of the other chemical properties in the dataset free.sulfur.dioxide and total.sulfur.dioxide are right skewed. The existence of these properties in wine is important as it prevents microbial growth and the oxidation. However, concentrations over 50 ppm become evident in the nose and taste and are therefore most likely less disirable. Perhaps this can explain some of the variance in the lower quality ranks. Therefore, my intution at this point is that the distributions are right skewed because of this.

I am also attempting to transform both variables using a log10 method in order to get a clearer view of the distrutions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Interestingly, the log10 method helped significantly to transform the variables to a more normal distribution. While total.sulfur.dioxide shows a nearly bell-type pattern in the histogram we can observe that free.sulfur.dioxide is distributed in a more scattered way. There are some outliers that contain pretty much no or very little free.sulfur.dioxide. In further evalutation of the same variable we see that its presence throughout the distribution is rather irregular. However, due to the nature of the variable itself I have a very limited intuition about its significance in relationship to quality other than what I have already quoted above.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Desity is almost perfectly bell-shaped in its distribution. Its value on the scale depends on the percent of alcohol and sugar content in the red wine. I have daubts that density can explain a lot of variance in quality but we will see how it performs in a correlation analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Very much like Desity, PH is relativly normal in its distribution. From the text file, which describes the variables and how the data was collected, I can gather that most wines are between 3-4 on the pH scale that spans from 0 (very acidic) to 14 (very basic). The data confirms this information and is illustrating that only lower PH values are suitable for wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates are additives which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The first histogram shows a right skewed distribution of how the data appears as a whole. Some outliers are stretching the distribution about 1/3 of where the majority of the data sits.

I was attempting to transform the variable by limiting the histogram to 99% of the data. It appeared as if the distribution was still skewed to the right so that I decided to apply another transformation using a log10 method.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol, I would assume, is one of the most important variables in wine. There is a minimum of at least 8.5% of alcohol in every wine and a maximum of 15%. The distribution is right skewed peaking at 9.5% affecting the total mean to μ10.42.

Unfortunately, I was not able to find an applicable transformation method that could bring the distribution to a more suitable shape for analysis. However, limiting the data to 99% excluding some of the outliers could fi some degree of the skeweness. However, the peak at 9.5% remains as the main explaination for its shape in the histogram.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines included in the dataset with 12 features (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”. The variable quality is an integer whereas all other variables are numeric.

Other observations:

Higher number of wines from medium to higher quality.
About 8% of red wines contain no citric acid.
The median for residual.sugar is 2.200 and the max is 15.500.

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality. My intution leads me to assume that alcohol could be another main feature. However, I lack the necessary domain knowledge on chemcial properties and wine to identfy other main features from the analysis above and wihtout further statistical investigation.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I have a feeling that the ratio of precense of one chemical property with the absence of another could possibly explain some of the variance in quality.

Did you create any new variables from existing variables in the dataset?

I lack the necessary domain knowledge on chemcial properties and wine to identfy appropriate opurtunities in the dataset that would make such operation useful.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I have transformed quality to a factor as it is a categorical variable on a scale between 0 (very bad) and 10 (very excellent). Moreover, I have performed transformations on the following few variables in the dataset as many of them have been right skewed: citric.acid.sqr, residual.sugar.log10, chlorides.log10, free.sulfur.dioxide.log10, total.sulfur.dioxide.log10, sulphates.log10. The transformations have been sucessful exept for citric.acid.sqr. This variable still shows a bimodal wavepattern distribution with counts peaking around 0.25 and 0.50.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

In order to get a better visual representation of the correlations, a heatmap of correlations is shown below. Warm colors indicate negative correlations whereas cold colors indicate positive correlations.

At this point I was also wondering how the transformed variables would perform with each other and if there is any significant improvements. The pairs.panel below is intended to show just that.

The heatmap above provides some intersting insides about the nature of the correlations without including any of the transformed variables. Moreover, the pairs.panels function from the psych package shows the relationship of the transformed variables.

Generally, the bivariate correlations are rather weak and also the transformed variables are not necessarily performing better. The correlation coefficient improves slightly but not in a meaningful way that would justify further investigation.

The variables that correspond the strongest with quality are volatile.acidity and alcohol whereby the relationship itself is weak. The direction of volatile.acidity with quality is negative meaning that lower levels of volatile.acidity tend to have a better quality, which is shown in the boxplot below.

## dataset$quality.factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## dataset$quality.factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## dataset$quality.factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## dataset$quality.factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## dataset$quality.factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## dataset$quality.factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

On the other hand alcohol has a postive relationship with quality meaning, to put it cautiously, that its presence tend to influence a higher quality in wine as shown in the boxplot below. Interstingly, medium quality wine has the biggest range in alcohol. I wonder if the combination with volatile.acidity can reveal more insights?

## dataset$quality.factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## dataset$quality.factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## dataset$quality.factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## dataset$quality.factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## dataset$quality.factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## dataset$quality.factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
## dataset$quality.factor: 3
## [1]  8.4 11.0
## -------------------------------------------------------- 
## dataset$quality.factor: 4
## [1]  9.0 13.1
## -------------------------------------------------------- 
## dataset$quality.factor: 5
## [1]  8.5 14.9
## -------------------------------------------------------- 
## dataset$quality.factor: 6
## [1]  8.4 14.0
## -------------------------------------------------------- 
## dataset$quality.factor: 7
## [1]  9.2 14.0
## -------------------------------------------------------- 
## dataset$quality.factor: 8
## [1]  9.8 14.0
## # A tibble: 6 x 4
##   quality.factor count  mean    sd
##   <fct>          <int> <dbl> <dbl>
## 1 3                 10  9.96 0.818
## 2 4                 53 10.3  0.935
## 3 5                681  9.90 0.737
## 4 6                638 10.6  1.05 
## 5 7                199 11.5  0.962
## 6 8                 18 12.1  1.22

Looking at the relationships between the supporting variables I can see that total.sulfur.dioxide and free.sulfur.dioxide moderatly correlate with each other. According to the text file which describes the variables this makes a lot of sense as free.sulfur.dioxide is part of the meassurement of total.sulfur.dioxide. I decided to not further investigate this relationship.

We can also see a weak to moderate negative correlation between density and alcahol. This makes sense as well as density of water is close to that of wine depending on the percent alcohol and sugar content.

The first scatterplot above shows the moderate negative relationship of density and alcohol of -0.50. That means as one variables increases, the other variable decreases.

On the other hand density and fixed.acidity have a moderate to strong postive relationship of 0.67. That means fixed.acidity tends to increase with density.

Both scatterplots above attempt to explore the relationships between citric.acid (-0.54) and pH as well as fixed.acidity and pH (-0.68) found in the correlation matrix. The negative relationships can be seen in the dottet scatterplot-cloud. Again, I wonder how some of these variables related to acidity, pH, density and alcohol might support each other with regards to quality during the bivariate analysis.

I will focus on exploring the statistical relationship between alcohol and acidity on quality as a factor. I am showing an initial one-way ANOVA test below including plots checking homogeneity of variance and normality. I am going to start building a model with alcohol ~ factor(quality):

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## factor(quality)    5  483.9   96.79   115.9 <2e-16 ***
## Residuals       1593 1330.8    0.84                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Points 653 and 145 are detected as outliers. I want to have a look at a similar model with volatile.acidity ~ factor(quality) before I further optimize the model.

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## factor(quality)    5   8.22   1.645   60.91 <2e-16 ***
## Residuals       1593  43.01   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Points 128, 127 and 1300 are detected as outliers. I will remove them along with the outliers from alcohol ~ factor(quality) and re-run the calculation in order to minimize affects on normality and homogeneity of variance.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        1  378.2   378.2   480.1 <2e-16 ***
## Residuals   1561 1229.6     0.8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After subsetting to about 99% of the distribution in alcohol I could sucessfully remove all the relevant outliers.

##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        1   5.79   5.786   239.3 <2e-16 ***
## Residuals   1561  37.75   0.024                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After subsetting to about 99% of the distribution in volatile.acidity I could sucessfully remove all the relevant outliers.

## Levene's Test for Homogeneity of Variance (center = mean)
##         Df F value    Pr(>F)    
## group    5  23.013 < 2.2e-16 ***
##       1557                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Shapiro-Wilk normality test
## 
## data:  aov_residuals.alcohol
## W = 0.97013, p-value < 2.2e-16

After subsetting to about 99% of the distribution in alcohol and volatile.acidity I could sucessfully remove all the relevant outliers. Moreover, I ran a Levene’s Test with a p-value that is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Moreover, the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = < 0.05) shows that, even after transforming the data, normality is violated.

## Levene's Test for Homogeneity of Variance (center = mean)
##         Df F value Pr(>F)
## group    5  1.5539 0.1701
##       1557
## 
##  Shapiro-Wilk normality test
## 
## data:  aov_residuals.volatile.acidity
## W = 0.99219, p-value = 2.287e-07

I also ran a Levene’s Test with a p-value that is not less than the significance level of 0.05 for volatile.acidity. Moreover, the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = < 0.05) normality is violated indicating that both independent variables population is not normal:

I am investigating further by using a Kruskal-Wallis rank sum test. This will allow me to see if levels of alcohol in lower level wines have identical data distributions than levels of alcohol in medium to higher quality wines without assuming the data to have normal distribution at a .05 significance level.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  alcohol by quality
## Kruskal-Wallis chi-squared = 398.93, df = 5, p-value < 2.2e-16
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  rm_outliers_subset$alcohol and rm_outliers_subset$quality 
## 
##   3       4       5       6       7      
## 4 0.21614 -       -       -       -      
## 5 0.67190 0.06119 -       -       -      
## 6 0.01232 0.00888 < 2e-16 -       -      
## 7 7.8e-05 2.8e-11 < 2e-16 < 2e-16 -      
## 8 0.00134 5.8e-05 4.2e-08 0.00022 0.16378
## 
## P value adjustment method: BH

After seeing that the test is signifaicant I decided to apply the same method for volatile.acidity.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  volatile.acidity by quality
## Kruskal-Wallis chi-squared = 218.48, df = 5, p-value < 2.2e-16
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  rm_outliers_subset$volatile.acidity and rm_outliers_subset$quality 
## 
##   3      4       5       6       7     
## 4 0.4235 -       -       -       -     
## 5 0.0409 0.0080  -       -       -     
## 6 0.0049 5.2e-07 < 2e-16 -       -     
## 7 0.0004 5.1e-13 < 2e-16 1.7e-13 -     
## 8 0.0025 6.8e-05 5.8e-05 0.0099  0.8603
## 
## P value adjustment method: BH

Interestingly, this data reveals that pretty much only the medium quality levels differ from each other. We see significant differences from 5-4, from 5-6 and from 6-7. Moreover, the variable from the very top correlate with the bottom quality categories.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Overall, the relationships between the feature of interest and the other features in the dataset present themself rather weak. Even the transformation of some of the variables could not really make a difference to change this perception. The strongest relationship with quality has alcohol meaning that higher levels of alcohol moderately correlate with higher quality ratings.

The variable volatile.acidity has the second strongest relationship with quality and among all the other variables in the dataset it is the only one left worth mentioning in the context of quality.

http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Alcohol and volatile.acidity have the strongest relationshop with quality. With significant p-values from both performed Kruskal-Wallis rank sum tests I can assume that the variation of alcohol and volatile.acidity among different quality categories is much larger than the variation of alcohol and volatile.acidity within each quality category. Hence I could conclude that there is a significant relationship between quality categories and alcohol as well as volatile.acidity.

Multivariate Plots Section

My first intention for the multivariate section was to color some of the scatterplots from above to see if quality will reveal some additional patterns.

The coloring with quality.factor.

Moreover, I was trying to explore additional variables with density, alcohol and quality.

I have also highlighted quality.factor in the colors with fixed.acidity and pH as well as citric.acid and pH.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

My idea for this part of the assigment was to add the quality factor to the existing plots I created. The coloring with quality.factor indicates a tendency that higher qualities tend to have more alcohol and less density whereas higher desity and lower levels of alcohol indicate more medium to lower quality wines.

It also seems like as if quality is slightly layered with fixed.acidity and density. This could mean that higher quality wines tend to have higher amounts of fixed.acidity while having slighty lower amounts of density as medium quality wines.

Moreover, I was trying to explore additional variables with density, alcohol and quality that I had originally found being minimally connected with quality from the cor table. However, the plots do not really reveal any new secrets worth following up on. I have also highlighted quality.factor in the colors with fixed.acidity and pH as well as citric.acid and pH. This, however, illustrates how neither of these relationships is tied to quality in any meaningful and visual way.

Were there any interesting or surprising interactions between features?

Unfortunately, I haven’t been able to un-cover any new insights in this section that really suprised me. I can imagine that a more sophisticated familiarity with the chemical properties in the dataset could result in better insights.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I have not created any models beyond the bivariate models through the lack of knowledge of how to choose an appropriate statistical procedure. Further investigation would be needed.


Final Plots and Summary

Plot One

Description One

I choose a box plot for my first two plots as I believe it is the best way to visualize the most important data in this assignment. The box plot show the distributions of alcohol within different quality groups, along with the median, range and outliers. The width of the boxes is proportional to the number of observation it contains. We can see that most of the wines are distributed between 5 and 6. These two categories also have the greatest variance, which is especially effecting the category 5. Overall, however, we can see a clear trendline underpinning the staticial results that there is a significant difference in the distribution of alcohol in the among some of the categories.

Plot Two

Description Two

I also choose a box plot for my second plot in order to highlight the second most important variable explaining some of the variance in quality. Just like the first box it shows the distributions of the variable within different quality groups, along with the median, range and outliers. We can see that most of the wines are distributed between 5 and 6 and that these two categories have the greatest variance. Overall the trendline is more clear than in the first plot as these outliers are a little bit more distributed. The plot is also underpinning the staticial results from the Kruskal-Wallis test showing that there is a significant difference in the distribution of volatile.acidity among some of the categories.

Plot Three

Description Three

The third and last plot is a summarisation of how I would want to further explore the model explaining the variance in quality. It is an attempt to combine the most important variable in the eploratiry data analysis. Alcohol and quality have the strongest relationship in the analysis so far. My thought was to use this relationship and expand with density as density correlate very well back with alcohol. The less dense the wine is the more alcohol it seems to contain as the negative relationship suggests. I was coloring this relationship with the factor quality to see if this can further clarify. It looks like as if lower quality wines tend to have less alcohol, which in turn results in higher density. The size of the points in the scatterplot is determined by the level of volatile.acidity as I was hoping it would add to the overall significance of the plot. Unfortunately, I don’t see the size by volatile.acidity to be that informative. ——

Reflection

It was a very challenging assigment for me but I learned a lot of new things. I am especially happy about having worked so much with ggplot as this stage. I feel super comfortable using it and resolving issues I havn’t figured out about the tool yet. From an analysis perspective I had thought that I might be able to find similar significant results as in the diamond dataset. While I am happy with some of the finding I have presented I feel that the report lacks to deliver any suprising or outstanding results. The strongest results I was able to show within the scope of the exploratoy data analysis was the influence alcohol has on quality. After all it seems that the fun factor about wine is the biggest predictor that determines a good wine. Probably very much to the disaffection of sommeliers. On the other hand it became clear that the right amount of accidity is an important factor for quality as well. Too much of its presence tends to result in lower quality scores. With alcohol and volatile acidity being the main contributors for quality I could imagine to continue deep diving in an advanced analysis. Perhaps there could be a way to merge variables describing accidity in order to understand their affect on quality better.

I was also quite busy with my dayjob so I’m really happy that I was actually able to subbmit this assignment way past due date. One aspect I have mentioned a lot is domain knowledge. With more time available I would try to study the variables in more detail in order to finetune the analysis and derive more insights.

References:

R for Data Science Book by Garrett Grolemund and Hadley Wickham   https://www.r-bloggers.com/
http://www.sthda.com/english/wiki/one-way-anova-test-in-r